Anonymization with Worst-Case Distribution-Based Background Knowledge 
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Abstract 

Background knowledge is an important factor in privacy 
preserving data publishing. Distribution-based background 
knowledge is one of the well studied background knowledge. 
However, to the best of our knowledge, there is no existing 
work considering the distribution-based background knowl- 
edge in the worst case scenario, by which we mean that the 
adversary has accurate knowledge about the distribution of 
sensitive values according to some tuple attributes. Con- 
sidering this worst case scenario is essential because we 
cannot overlook any breaching possibility. In this paper, 
we propose an algorithm to anonymize dataset in order to 
protect individual privacy by considering this background 
knowledge. We prove that the anonymized datasets gener- 
ated by our proposed algorithm protects individual privacy. 
Our empirical studies show that our method preserves high 
utility for the published data at the same time. 



1 Introduction 

Privacy preserving data publishing is an important topic 
in the Hterature of privacy for very pragmatic reasons. As 
an example, AOL did not take sufficient precaution and en- 
countered some undesired consequences. A dataset about 
search logs was published in 2006. Later AOL realized that 
a single 62 year old woman living in Georgia can be re- 
identified from the search logs by some New York Times 
reporters. The search logs were withdrawn and two em- 
ployees responsible for releasing the search logs were fired 
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Example 1 (Data Publishing) Suppose a table T like Ta- 
ble[T]is to be anonymized for publication. Table T has two 
kinds of attributes, (1) the quasi-identifier (QI) attributes 
and (2) the sensitive attribute. (1) The QI attributes can be 



used as an identifier in the table. In our example, the QI at- 
tributes are Nationality and Zipcode. Attribute Name is just 
for discussion and is not used for publication. 1 17| points 
out that in a real dataset, about 87% of individuals can be 
uniquely identified by some QI attributes with a publicly 
available external table such as a voter registration lisQ An 
example of a voter registration list is shown in Table |2] (2) 
The sensitive attribute contains some sensitive values. In 
our example, the sensitive attribute is "Disease" containing 
sensitive values such as Heart Disease and HIV. Assume 
that each tuple in the table is owned by an individual and 
each individual owns at most one tuple. 

Our target is to anonymize T and publish the 
anonymized dataset T* like Table |3] to satisfy some pri- 
vacy requirements. A typical anonymization is described 
as follows. T is horizontally partitioned into multiple tuple 
groups. Let P be a resulting group. We give a unique ID 
called GID to P and all tuples in P are said to have the same 
GID value. An anonymization defines a function /3 on each 
P to form an anonymized group (in short, A-group) such 
that the linkage between the QI attributes and the sensitive 
attribute in the A-group is broken. One way to break the 
linkage is bucketization, forming two tables, called the QI 
table (Table [3 a)) and the sensitive table (Table [3jb)): P is 
projected on all QI attributes and attribute GID to form the 
QI table, and on the sensitive attribute and attribute GID to 
form the sensitive table. Therefore, a table T is anonymized 
to a dataset T* if T* is formed by first partitioning T into 
a number of groups, then forming an A-group from each 
partition by j3 and finally inserting each A-group into T* . 



There are many sources of such an external table. Most municipah- 
ties sell population registers that include the identifiers of individuals along 
with basic demographics; examples include local census data, voter lists, 
city directories, and information from motor vehicle agencies, tax asses- 
sors, and real estate agencies 1 15 1. From 1 17|, it is reported that a city's 
voter list in two diskettes was purchased for twenty dollars, and was used 
to re-identify medical records. 



Name 


Nationality 


Zipcode 


Disease 


Alex 


American 


55501 


Heart Disease 


Bob 


Japanese 


55502 


Flu 




Japanese 


55505 


Flu 




Japanese 


55504 


Stomach Virus 




French 


66601 


HIV 




Japanese 


66601 


Diabetes 











Table 1 . An example 



Name 


Nationality 


Zipcode 


Alex 


American 


55501 


Bob 


Japanese 


55502 


Chris 


Japanese 


55503 


David 


Japanese 


55504 


Emily 


French 


66601 


Fred 


Japanese 


66601 









Table 2. Voter reg- 
istration list 



Nationality 


Zipcode 


GID 


American 


55501 


Li 


Japanese 


55502 


Li 


Japanese 


55503 


L2 


Japanese 


55504 


L2 


French 


66601 


Ls 


Japanese 


66601 











(a) Ql Table 



GID 


Disease 


Li 


Heart Disease 


Li 


Flu 


L2 


Flu 


L2 


Stomach Virus 


L3 


HIV 


La 


Diabetes 







(b) Sensitive table 



Table 3. A 2-diverse 
anonym ized from Table [1] 



dataset 



For example. Table [T]is anonymized to Table [3]by bucketi- 
zation. Such an anonymization is commonly adopted in the 
literature of data publishing ||20ll2T] [T4l [TSl [TOl . 

There are many privacy models in the literature such as 
fc-anonymity 1 17 1 , /-diversity [131 , t-closeness |9|, (fc,e)- 
anonymity 11211 . Injector ifTOl and m-confidentiality ifTSl . 
For illustration, let us consider a simplified setting of the l- 
diversity model [131 as a privacy requirement for published 
data T*. An A-group is said to be l-diverse or satisfy l- 
diversity if in the A-group the number of occurrences of 
any sensitive value is at most 1// of the group size. A ta- 
ble satisfies /-diversity (or it is /-diverse) if all A-groups in 
it are /-diverse. Suppose that Table [T]is anonymized to Ta- 
ble |3] Consider the A-group with GID equal to Li which 
corresponds to the first two tuples in Ql table (Table [3ja)) 
and the first two tuples in sensitive table (Table Ob)). In 
the following, we simply refer to the A-group with GID 
equal to Li by Li. Since Li contains two tuples, the group 
size of Li is equal to 2. Since the number of occurrences 
of any sensitive value (i.e., 1) is at most 1/2 of the group 
size, Li satisfies 2-diversity. Similarly, L2 and L^ satisfy 
2-diversity. Thus, Table |3] satisfies 2-diversity. The inten- 
tion of 2-diversity is that each individual cannot be linked 
to a disease with a probability of more than 0.5 without any 
additional background knowledge. 

However, this table does not protect individual privacy 
sufficiently if we consider background knowledge. g 

Example 2 (Background Knowledge) Consider Li in Ta- 
ble [3] In Li, Heart Disease and Flu are values of the sen- 
sitive attribute Disease. Since most individuals can be re- 
identified by the Ql attributes with a publicly available ex- 
ternal table such as voter registration list IfTTl , if we are 
given the voter registration list as shown in Table |2] it is 
easy to figure out that the two tuples in Li correspond to 
Alex and Bob. From Li, it seems that each of the two indi- 
viduals, Alex and Bob, in this group has a 50% chance of 
linking to Heart Disease (Flu). The reason why the chance 
is interpreted as 50% is that the analysis is based on this 
group without any additional information. 

Suppose we are given a probability distribution as shown 
in Table m The distribution of attribute set {"Nationality"} 



consists of the probabilities that a Japanese, an American or 
a French is linked to "Heart Disease" (and "Not Heart Dis- 
ease"). For example, the probability that American is linked 
to Heart Disease is 0.1 and the probability that Japanese 
is linked to Heart Disease is 0.003. With this distribution, 
the adversary can say that Bob, being a Japanese, has less 
chance of having Heart Disease. S/he can deduce that Alex, 
being an American, has a higher chance of having Heart 
Disease. The intended 50% threshold is thus violated. g 

Hence background knowledge has important impact on 
privacy preserving data publishing. Recent works ll9l 11211X41 
|6] [T8 1 start to focus on modeling background knowledge. 
Distribution-based background knowledge is one type of 
the well-known background knowledge which is used in the 
state-of-the-art privacy model, t-closeness. Distribution- 
based background knowledge (|9] [l2l is the information re- 
lated to the distribution about sensitive information in data. 
There are at least two kinds of distribution-based back- 
ground knowledge, namely dataset based distribution and 
Ql based distribution. The dataset based distribution is the 
distribution of the values in the sensitive attribute according 
to the entire dataset [|9|. The Ql based distribution is the 
distribution of the values in the sensitive attribute restricted 
to individuals with the same values on some Ql attributes 

m- 

Example 3 (Distribution-based background knowledge) 

Suppose that there are 100,000 individuals in the dataset T 
and with 6,000 individuals linking to "Heart Disease". The 
probability that an individual t in the dataset is linked to 
"Heart Disease" is 0.06. The dataset based distribution has 
been considered by |9 1. 

In this paper we consider Ql based distribution W2\ . 
Some well-known examples of such knowledge are the facts 
that Japanese seldom suffer from Heart Disease |13| and 
male individual cannot be linked to ovarian cancer 1 10|. For 
example, the distribution of the sensitive attribute according 
to Japanese may be encoded as {(Japanese:"Heart Disease", 
0.003), (Japanese;"Flu", 0.21), ...} where (Japanese:^, p) 
denotes that the probabiUty that a Japanese is linked to a 
value a; is p. n 
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Heart Disease 


Not Heart Disease 


American 


0.1 


0.9 


Japanese 


0.003 


0.997 


French 


0.05 


0.95 



Table 4. A Ql based distribution of attribute 
"Nationality" for our motivating example 



If the QI based background knowledge is accurate, we 
say that we have the worse case scenario. Considering the 
worst-case scenario is essential in data publishing 1141 16] 
[181 because it gives the maximal protection 111]. To the best 
of our knowledge, there is no existing work considering the 
worst-case QI based distribution. 

There is only one work 1 12] closely related to ours. How- 
ever, IIT2I considers the QI based distribution background 
knowledge with uncertainty. Specifically, in |[12|, the un- 
certainty of the background knowledge is denoted by an 
input parameter B. Conceptually, if B is equal to 0, then 
the adversary has the clearest understanding about back- 
ground knowledge which corresponds to the worst-case 
background knowledge. However, if B is set to in the 
model proposed by |12J, then the background knowledge 
is undefined. Also [12] adopts a brute force approach in 
the anonymization by checking the breaching probability 
of anonymzied groups. There are two disadvantages on this 
approach. The first problem is that the breaching probability 
is hard to compute and therefore approximation is needed in 
their method, which sacrifices the correctness. The second 
problem is that the breaching probability is not monotone in 
that an A-group that violates privacy may be split into two 
groups that preserve privacy. Therefore, even though Mon- 
drian 1 8 1 is adopted as their algorithm, it does not guarantee 
an optimal solution in spite of the effort in exhaustive search 
in each iteration in the top-down processing. Our solution 
will overcome both of these problems. 

Building on previous works, we propose a new method 
to handle the worse case background knowledge. The 
essence of our method is the following. We observe that 
privacy is breached whenever an individual in an A-group 
has a much higher chance of linking to a sensitive value 
compared with another individual in the A-group accord- 
ing to the QI based distribution. Based on this observation, 
we propose a solution which generates a dataset such that 
all individuals in each A-group have "similar" chances of 
linking to any sensitive value in the group, according to the 
distribution. For example if we form a group with an Amer- 
icans and a Canadian, linking to heart disease and flu, and 
suppose the probabilities of Americans and Canadians be- 
ing linked to heart disease and to flu are similar. Since they 
have "similar" chances, it is not possible for the adversary 
to pinpoint any linkage of an individual to a sensitive value 
with a higher chance. At the same time, our methods can 



maintain high utility for the published table. 

Our contributions can be summarized as follows. Firstly, 
to the best of our knowledge, we are the first to handle the 
worst-case QI based distribution. Secondly, we derive an 
interesting and useful theoretical property and based on this 
property, we propose an algorithm which generates a dataset 
protecting individual privacy in the presence of the worst- 
case QI based distribution. Finally, we have conducted ex- 
periments which shows that our proposed algorithm is effi- 
cient and incurs low information loss. 

2 Problem Definition 

Let T be a table. We assume that one of the attributes is 
a sensitive attribute X where some values of this attribute 
should not be linkable to any individual. These values are 
called sensitive values. The value of the sensitive attribute 
of a tuple t is denoted by t.X. A quasi-identifier (QI) is 
a set of attributes of T, Ai, A2, Ag, that may serve as 
identifiers for some individuals. Each tuple in the table T 
is related to one individual and no two tuples are related to 
the same individual. With publicly available voter registra- 
tion lists (like Table m, the QI values can often be used to 
identify a unique individual iflTl [Tsl . 

There are two common approaches for anonymization, 
which generates T* from T. One is generalization by gen- 
eralizing all QI values in each A-group to the same value. 
The other is bucketization, which we have illustrated in the 
previous section. For the ease of illustration, we focus on 
bucketization. The discussion for generalization is similar 
With anonymization, there is a mapping which maps each 
tuple in T to an A-group in T*. For example, the first tuple 
<i in Table [T] is mapped to A-group Li. 

The aim of privacy preserving data publishing is to deter 
any attack from the adversary on linking an individual to 
a certain sensitive value. Specifically, the data publisher 
would try to limit the probability of such a linkage that can 
be established. 

In the literature ||20l [H] [lOl H), it is assumed that 
the knowledge of an adversary includes (1) the published 
dataset T*, (2) a publicly available external table T'^ such as 
a voter registration list that maps QIs to individuals ifTTlfTSl 
and (3) some background knowledge. We also follow these 
assumptions in our analysis. We focus on the QI based dis- 
tribution as background knowledge. 

The QI based distribution for the attribute set 
{"Nationality"} is described in Table|4] Each probability in 
the table is called a global probability. The sample space 
for each such discrete probability distribution consists of the 
possible assignments of the sensitive values such as x to an 
individual with the particular nationality. For nationality s, 
the sample space is denoted by fig. 

Each possible value in attribute "Nationality" in our 
example is called a signature. There are three possible 
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r 


an A-group (anonymized group) in the 




anonymized dataset 


A 


set of attributes e.g. {"Nationality", "Zipcode"} 


f ^ + AT 

Ci, tAT 


tuples in an yl-group 


5 1 , . . . , S J7T, 


signatures for e.g.{ American , 55501 } 




multiple tuples tj^s can map to the same Si 




otiiiM 11 V t vtnuta 


t'V'J • J 


probability that tuple tj is linked to value x 


Tii fiA ' T\ 


probability that signature Si is linked to x 


fl 


a simplified notation for p{si : x) 


w 


a possible world: an assignment of the tuples 




in A-group Lk to the sensitive values in Lk 


m 


set of all possible worlds w for Lk 




set of all possible worlds w in Wk 




in which tj is assigned value x 




probability that w occurs given A-group Lk 


Pj,w 


the probability that tj is linked to a value in the 




sensitive attribute as specified in w 


Table 6. Notations 



signatures in our example: "Japanese", "American" and 
"French". In general, there can be other attribute sets, 
such as {"Nationality", "Zipcode"}, with their correspond- 
ing Ql-based distributions. We define the signature and the 
Ql-based distribution for a particular attribute set A as fol- 
lows. 

Given a QI attribute set A with q attributes Ai, ...,Aq. 
A signature s of ,A is a set of attribute-value pairs 
{Ai,vi), {Aq, Vq) which appear in the published dataset 
T*, where Ai is a QI attribute and u,; is a value. A tuple t 
in T* is said to match s if tAi = Vi for all i — 1,2, ...,q. 
For example, a signature s can be {("Nationality", "Amer- 
ican"), ("Zipcode", "55501")} if the attribute set A is 
{"Nationality", "Zipcode"}. For convenience, we often 
drop the attribute names, and thus we have {"American", 
"55501"} for the above signature. The first tuple in Ta- 
ble|3la) matches {"American"} but the second does not. 

Given an attribute set A, the Ql-based distribution G of 
A contains a set of entries [s : x,p) for each possible sig- 
nature s of A, where p is equal to p{s : x) which denotes 
the probability that a tuple matching signature s is linked 
to X. For example, G may contain ("Japanese":"Heart Dis- 
ease", 0.003) and ("American":"Heart Disease", 0.1). This 

involves two sample spaces iljapanese and flAmerican- 

Definition 1 (r-robustness) Given the Ql-based distribu- 
tion, a dataset T* is said to satisfy r-robustness ( or T* is 
r-robust) if, for any individual t and any sensitive value x, 
the probability that t is linked to x, p(t : x), does not exceed 
1/r. 

We will discuss about the sample space for p{t : x) and 
derive a formula for p{t : x) in Section|3] In this paper, we 
are studying the following problem: given a dataset T, gen- 
erate an anonymized dataset T* from T which satisfies r- 
robustness and at the same time minimizing the information 
loss. There have been different definitions for information 
loss in the literature. In our experiments, we shall adopt the 
measurement of accuracy in query results from T* versus 
that from T. 

3 Probability Formulation 

For the sake of illustration, in this section, we consider a 
certain attribute set A and a sensitive value x. We will con- 
sider any attribute set and any sensitive value in Section |4] 

Suppose there are m possible signatures for attribute set 
A, namely si, S2, Sm- Let G be the background knowl- 
edge consisting of the set of all QI based distributions. In 
G, the probability that Si is linked to a sensitive value x is 
given by p{si : x). 

Given G, the formula for p{t : x), the probability that a 
tuple t is linked to sensitive value x, is derived below. 

In the following, we consider the anonymized dataset 
T*. Suppose t belongs to A-group Lk in T* . For the ease 



of reference, let us summarize the notations that we use in 
Table |6] We shall need the following definitions. 

Definition 2 (Possible World) Consider an A-group Lk 
with N tuples, namely ti,t2, ■■■,1^, with corresponding 
values in sensitive attribute X 0/71, 72, ...7Ar. A possible 
world wfor Lk is a possible assignment mapping the tuples 
in set {ti,t2, ■■■,tpf} to values in multi-set {71,72, ■■■Jn} 
in Lk. 

Given an A-group Lk with a set of tuples and a multi- 
set of the values in X. Considering all possible worlds, we 
form a sample space. More precisely, the sample space 
^w\Lk consists of all the possible assignments of the sen- 
sitive values in Lk to the N tuples in Lk- For each such 
possible world w, according to the QI based distribution G 
based on attribute set A, we can determine the probability 
p{'w\Lk) that w occurs given Lk- 

Definition 3 (Primitive Events, Projected Events) A 

mapping t : x from an individual or tuple t to a value x 
in the set of sensitive attributes is called a primitive event. 
Suppose t matches signature s- Let us call an event for the 
corresponding signature, "s : x", a projected event /or t- 
Note that this projected event belongs to sample space il^- 

A primitive event is an event in the sample space ft^^i^^- 
The probability of such an event, p{t : x), is the proba- 
bility of interest for the adversary. The probability of the 
projected event, p{s : x), is in the QI based distribution G. 

Similar to lfT3ll20l [T8 l. we assume that the linkage of a 
value in X to an individual is independent of the linkage of 
a value in X to another individual. For example, whether 
an American suffers from Heart Disease is independent of 
whether a Japanese suffers from Heart Disease. Thus, for a 
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pO 


X 


y 


Sl 
S2 


0.5 
0.2 


0.5 
0.8 



(a) conditional distribution 

Table 5, 



w 


^1 

(S^) 






(so) 


Til 7/) 1 




W\ 


X 


X 


y 


y 


0.5 X 0.5 X 0.8 X 0.8 = 0.16 


0.16/0.33 = 0.48 


W2 


X 


y 


X 


y 


0.5 X 0.5 X 0.2 X 0.8 = 0.04 


0.04/0.33 = 0.12 


W3 


X 


y 


y 


X 


0.5 X 0.5 X 0.8 X 0.2 = 0.04 


0.04/0.33 = 0.12 


Wi 


V 


X 


X 


y 


0.5 X 0.5 X 0.2 X 0.8 = 0.04 


0.04/0.33 = 0.12 


Ws 


y 


X 


y 


X 


0.5 X 0.5 X 0.8 X 0.2 = 0.04 


0.04/0.33 = 0.12 


We 


y 


y 


X 


X 


0.5 X 0.5 X 0.2 X 0.2 = 0.01 


0.01/0.33 = 0.03 



(b)p(TO) and life) 

An example illustrating the computation o\p{t.j 



possible world w for L^, the probability that w occurs given 
Lk is proportional to the product of the probabilities of the 
corresponding projected events for the tuples i i , . . .tAr in Lfe, 
we shall denote this product as p{w): 



p{w) = pi^^ X P2,tu X ••• X PN,w 



(1) 



where pj,w is the probability that tj is linked to a value 
in the sensitive attribute specified in w. Suppose tj matches 
signature s^. If tj is linked to x in w, then pj^^ ~ p{si '■ x). 

Let the set of all the possible worlds for Lk be Wfc. The 
sum of probabilities of all the possible worlds given Lk 
must be 1, since they form the sample space Hence, 
the probability of w given Lk is given by: 

For w G Wfc, we have 



piw\Lk) 



p{w) 



(2) 



Our objective is to find the probability that an individual 
tj in Lk is linked to a sensitive value x. This is given by the 
sum of the probabilities p{w\Lk) of all the possible worlds 
w where t, is linked to x. 



p{t.j : x) = E^gyy(*.-) p{w\Lk) 



(t ■ -x) 

where W^, is a set of all possible worlds w in Wk in 
which tj is assigned value x. 

Example 4 Consider an A-group Lk in a published table 
T*. Suppose there are four tuples, ii , t2, ts and t4, with the 
X values of x, x, y, y in Lk- Suppose the published table 
T* satisfies 2-diversity. 

Consider the QI based distribution G based on a certain 
QI attribute set A which contains two possible signatures 
Sl and 32- TablelSja) shows the four global probabilities, 
namely p(si : x) ~ 0.5,p(si : y) = 0.5,p(s2 : x) — 
0.2,p(s2,2;) = 0.8. 

Suppose ti,t2,t3 and <4 match signatures si,si, S2 and 
S2, respectively. There are six possible worlds w as shown 
in Table HJb). For example, the first row is the possible 
world wi with mapping {ti : x, t2 : x, t^ : y, t^ : y}. 
The table also shows the values p{w) of the possible worlds. 



(3) 



Take the first possible world wi for illustration. From the QI 
based distribution in Table|3a), p{si : x) = 0.5 and p{s2 : 
y) = 0.8. Hence,p(wi) = 0.5x0.5x0.8x0.8 = 0.16. The 
sum of p{w) of all possible worlds from Table |5jb) is equal 
to 0. 16 + 0.04 + 0.04 + 0.04 + 0.04 + 0.01 = 0.33. Consider 
wi again. Since = 0.16, p{wi\Lk) = 0.16/0.33 = 

0.48. 

Suppose the adversary is interested in the probability that 
ti is linked to x. We obtain : x) as follows, wi, W2 and 
ws, as shown in Table|5lb), contain "ii : x". Thus, : x) 
is equal to the sum of the probabilities p{wi\Lk) , p{w2\Lk) 
andp{w3\Lk). p{ti : x) = 0.48 + 0.12 + 0.12 = 0.72. Note 
that this is greater than 0.5, the intended upper bound for 2- 
diversity that an individual is linked to a sensitive value, g 

4 Algorithm for Data Publishing 

Given the formulation of p(t : x), a naive approach 
for r-robustness is to adopt some known anonymization 
algorithm A and replace the probability measure in A by 
p{t : x). However, the complexity of computing p{t : x) is 
very high given the exponential number of possible worlds. 
Moreover, r-robustness is not monotone in the sense that an 
A-group that violates r-robustness may be split into small 
groups that are r-robust, while known top-down algorithms 
are based on monotone privacy conditions. 

This section presents an algorithm for generating an r- 
robust table that overcome the above problems. Section|4T| 
first presents an important theoretical property for this prob- 
lem. Section 14.21 then describes our proposed algorithm, 
ART 

4.1 Theoretical Property 

In Section [U we observe that privacy is breached easily 
whenever an individual in an A-group has a much higher 
chance of linking to a sensitive value compared with an- 
other individual in the A-group. For example, consider 
the A-group Li in Table [3] From the Ql-based distribu- 
tion (Table nil, it is more likely that American is linked to 
Heart Disease compared with Japanese, we can deduce that 
Alex, an American, has Heart Disease with higher proba- 
bility. Note that the global probability of American link- 
ing to Heart Disease, denoted by /i, is 0.1 and the global 
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probability of Japanese linking to Heart Disease, denoted 
by /2, is 0.003. The difference in the global probabilities 
is 0.1 - 0.003 = 0.097. Since the A-group size is small, 
the difference gives some information to aid privacy breach. 
The difference in the global probabilities and the A-group 
size are the properties of the A-group. 

In the following, we have a theorem on the relationship 
between the privacy guarantee and the properties of an A- 
group L. Consider a tuple ty in an A-group L. We want 
to show that, if the properties of L satisfy some conditions, 
the privacy of i„ can be guaranteed (i.e., pit^ : x) < 1/r). 
The conditions essentially limits the deviations in the global 
probabilities in terms of the group size. 

In the following we consider the QI based distribution G 
on a certain attribute set A. The algorithm to be described 
later will consider multiple attribute sets. 

Definition 4 (Greatest Probability Deviation A) Let L 

be an A-group in T* with tuples ti,t2, ...tN where N is 
the group size and N > r. Let x be a sensitive value that 
appears once in L. Without loss of generality, suppose 
tuple ty matches signature Sy, v £ [1,-^]- Thus, tuple ty 
has the QI based probability ( or global probability) linking 
to X in L equal to p{sv : x) = fy. 

Tet fmax be the greatest global probabilities in L (i.e.. 



fmax 

given f„ 



maxy^^i j^] fy). The probability deviation of ty 



is A, 



fn 



fy,ve[i,N]. 



Let us give some examples to illustrate the above nota- 
tions. In our running example of Li, the group size N is 
equal to 2. In Li, the first tuple (Alex) is ti and the sec- 
ond tuple (Bob) is <2- Let si = {"American"} and S2 = 
{"Japanese"}. Thus, /i = 0.1 and /2 = 0.003. We know 
that <i matches si and t2 matches §2. Since fmax is the 
greatest global probabilities in L, fmax is equal to 0.1 (be- 
cause /i = 0.1 and /2 = 0.003). Thus, Ai — fmax — fi = 
0.1-0.1 = and A2 = fmax-fi = 0.1-0.003 = 0.097. 

Theorem 1 Let r be the privacy parameter in r-robustness 
where r > 1. Following the symbols in Definition^ if for 
allv e [l,N], 



A„ < 



{N - r)f„ 



(4) 



fmaxir~l)/il-fmax) + iN-l) 

then for all V G [1, N], p(ty : x) < 1/r 

Proof: The proof is given in the appendix. g 

Definition 5 (A, Amax) Amax is defined to be the R.H.S. 
of Inequality (|4|. That is. 



(N - r)f„ 



frnaxir-l)/il- .fmax) + iN -1) 

Define A = max„g[i jv]{^i)} 



N 




Jmax 




3 


2 


O.I 


0.0474 


3 


2 


0.3 


0.1235 


3 


2 


0.5 


0.1667 


3 


2 


0.9 


0.0818 


4 


2 


0.3 


0.1750 


6 


2 


0.3 


0.2211 


6 


3 


0.3 


0.1537 


6 


4 


0.3 


0.0955 



Table 7. Values of Amax with some chosen 
values of N, r and fmax 



Hence, A is the greatest difference in the global proba- 
bilities linking to x in an A-group. Note that A > 0. In our 
running example, since Ai = and A2 = 0.097, we have 
A = max{0, 0.097} = 0.097. 

Consider another example. If an A-group L contains 
three tuples matching si,S2 and S3 with the global prob- 
abilities /i = 0.1, /2 = 0.08 and /a = 0.09. Then, iV = 3 
and fmax = 0.1. A = 0.1 - 0.08 = 0.02. Suppose r = 2. 
The R.H.S. of © is Amax = (3 - 2) X 0.1/fo.l x (2 - 
1)7(1-0.1) + (3-1)] = 0.0474. Since A < 0.0474, from 
Theorem [1] for all tuples ty in L, p{ty : x) < 1/r where 
r = 2. 

Let us consider the effects of the values of fmax and 
N to understand the physical meaning of Theorem [T] If 
fmax = 1 or fmax — 0, then A < 0. Hence, the QI based 
distributions of all tuples in L should be the same to guar- 
antee privacy. 

Table [7] shows the values of Amax with some chosen 
values of N, r and /. It can be seen that Amax is small 
when / is near the extreme values of or 1, since the global 
probability of a tuple is more pronounced. 

Consider Inequality (|4|i. If iV — > cxd, then A < fmax- 
Since fmax is the greatest possible global probability in L, 
it means that A can be any feasible value (i.e., < A < 
fmax)- Therefore, when the A-group is extremely large, 
under Theorem [T] there will be no privacy breach. When 
N = r, A < 0. That is, the global probabilities of all 
tuples in L should be equal. Otherwise, there may be a 
privacy breach. Furthermore, N has the following relation 

with Amax- 



Lemma 1 A^ 



Proof:Let / = /, 



is a monotonic increasing function on N. 



dN 



[(r_l)x^ + (W-l)]2 



> 

□ 

From the above, in order to guarantee p{ty : x) < 1/r, 
we can increase the size A'^ of the A-group L. With a greater 
value of N, the upper bound Amax increases, and the con- 
straint as dictated by Inequality (|4|i is relaxed, making it 
easier to reach the guarantee. 
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4.2 Algorithm ART 

Based on Theorem[Tl we propose an Algorithm generat- 
ing r-Robust Table called ART. If an A-group L satisfies the 
inequality in Theorem[T]with respect to attribute set A and, 
in L, each sensitive value occurs at most once, we say that L 
satisfies the QI based distribution bound condition with re- 
spect to A. Otherwise, L violates the QI based distribution 
bound condition. 

In the algorithm, initially, each individual forms an inde- 
pendent A-group. The algorithm repeatedly looks for any 
A-group such that there exists an attribute set A where it 
violates the QI based distribution bound condition with re- 
spect to A. Such a group is merged with other existing 
groups so that the resulting group satisfies the condition. 
After merging, the number of tuples in L, A^, is increased. 
Then, by Lemma[Tl /\max is also increased. The constraint 
by Inequality dU is relaxed and it is more likely to satisfy 
the QI based distribution bound condition. When a final so- 
lution is reached, each individual is linked to any sensitive 
value with probability at most 1/r. 

Specifically, algorithm ART involves two major steps. 

• Step 1 (Individual A-group Formation): For each tuple 
t in the table T, we form an A-group L containing t 
only. 

• Step 2 (Merging): For each sensitive value x, while 
there exists an A-group L and an attribute set A such 
that L violates the QI based distribution bound condi- 
tion with respect to A, we find a set C of A-groups 
such that, after merging all A-groups in C with L, 
the merged A-group satisfies the QI based distribution 
bound condition with respect to any attribute set A. 

The idea of Step 2 is to keep the A value in L with re- 
spect to A unchanged or only slightly increased after merg- 
ing. At the same time, we also make sure that each merged 
A-group contains at most one x for any sensitive value x. 
Before going into the details of Step 2, we need to define 
a new term. Given an A-group L, another A-group V is 
called a closest A-group with respect to L if, after merging 
L' and L, the increase in the value of A with respect to any 
attribute set is the smallest among all possible A-groups. 

Definition 6 (Closest A-group) Suppose /\\,^fore.,A repre- 
sents A with respect to an attribute set A in L and 
^after,A{L, L') represents A with respect to an attribute 
set A in the A-group obtained by merging L and L'. 

Let Da{L, L') = Aafter,A{L, L') — Abefore.A- 

LetD{L,L')^J2ADA{L,L'). 

L' is a closest A-group with respect to L if D{L, L') — 
mmL"{D{L,L")}. 



We are ready to describe Step 2 in details. Let Y{L) be 
the set of sensitive values which appear in an A-group L. 
Given an A-group, it is easy to derive A and fmax- Note 
that r is a user parameter After we know A, fmax ™d r, 
we can derive the expected minimum size of L based on the 
QI based distribution bound condition with respect to A, 
denoted by No- By replacing N with No and changing the 
subject of Inequality (|4|i in the QI based distribution condi- 
tion to No, we have 

T^r ^ i^fmaxij' -i)'^)/(-i fmax^ A -|- T fmax 
\J max ^ ) 

Let us choose a smallest integer N'o such that the above in- 
equality holds. We calculate N^ for every attribute set A 
and choose the greatest values of No as our final No- If 
the total number of tuples in L, N, is smaller than No, then 
we have to choose additional No — N tuples to be merged 
with L- We choose a closest A-group L' with respect to L 
where L' does not contain any sensitive value in Y{L)- L' 
is merged with L, and A, / and A^^ are updated accord- 
ingly. If the updated N value is still smaller than N^, then 
we repeatedly continue the above process. 

Theorem 2 Any table T* generated by Algorithm ART is 
r-robust- 

5 Empirical Study 

A Pentium IV 2.2GHz PC with 1GB RAM was used to 
conduct our experiment. The algorithm was implemented 
in C/C-H-. We adopted the publicly available dataset. Adult 
Database, from the UCIrvine Machine Learning Repository 
[4|. This dataset (5.5MB) was also adopted by Ii3jil8|. We 
used a configuration similar to ifTsl [Tsl . The records with 
unknown values were first eliminated resulting in a dataset 
with 45,222 tuples (5.4MB). Nine attributes were chosen 
in our experiment, namely Age, Work Class, Marital Sta- 
tus, Occupation, Race, Sex, Native Country, Salary Class 
and Education. By default, we chose the first five attributes 
and the last attribute as the quasi-identifer and the sensitive 
attribute, respectively. Similar to |18|, in attribute "Edu- 
cation", all values representing the education levels before 
"secondary" (or "9th- 10th") such as "lst-4th", "5th-6th" 
and "7th-8th" are regarded as a sensitive value set where an 
adversary checks whether each individual is linked to this 
set more than 1/r, where r is a parameter 

There are 3.46% tuples with education levels before 
"secondary". Since there is a set Q of multiple QI based 
distributions G, we can calculate p{t : x) for different G's 
and different x's. We take the greatest such value to report 
as the probability that individual t is linked to some sensi- 
tive value since this corresponds to the worst case privacy 
breach. 
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We compared our proposed algorithm ART with four al- 
gorithms, Anatomy |20|, MASK |18|, Injector |10| and t- 
closeness |9|. They are selected because they consider l- 
diversity or similar privacy requirements, so we need only 
set I = r. We are interested to know the overhead required 
in our approach in order to achieve r-robustness. When we 
compared ART with Anatomy, we set / — r. When we 
compared it with MASK, the parameters k and m used in 
MASK are set to r. For Injector, the parameters minConf, 
minExp and I are set to 1, 0.9 and r, respectively, which 
are the default settings in [ilOj . For t-closeness, similar to 
||9j, we set t = 0.2. We evaluate the algorithms in terms 
of four measurements: (1) execution time, (2) relative er- 
ror ratio, (3) the proportion of problematic tuples among 
all sensitive tuples and (4) the average value of A. 

(1) Execution time: We measured the execution time of 
algorithms. (2) Relative error ratio: As in ll20l[T8l[T0l . we 
measure the error by the relative error ratio in answering 
an aggregate query. We adopt both the form of the aggre- 
gate query and the parameters of the query dimensionality 
qd and the expected query selectivity s from I201 ITSl ITOl . 
For each evaluation in the case of two anonymized tables, 
we performed 10,000 queries and then reported the average 
relative error ratio. By default, we set s — 0.05 and qd to be 
the QI size. (3) Proportion of problematic tuples among all 
sensitive tuples: According to the probability formulation 
in Section |3] according to the anonymized table generated 
by all algorithms, we can calculate the probability that a tu- 
ple is linked to a sensitive value set. If the tuple has the 
probability > 1/r, it is said to be a problematic tuple. The 
tuples linking to sensitive values in the original table are 
called sensitive tuples. In our experiments, we measure the 
proportion of problematic tuples among all sensitive tuples. 
(4) Average value of A: More formally, the average value 
of A is evaluated with respect to every attribute set A con- 
taining large samples. Consider a sensitive value x. With 
respect to a certain attribute set A, the average value of A 
denoted by Hj[ is equal to ^ J^LeT* ^l, where u is the 
total number of A-groups in T* and A^, is the greatest dif- 
ference in the global probability linking to a sensitive value 
X with respect to A in an A-group L. Let B be the set of 
all attribute sets A containing large samples. With respect 
to every attribute set in B, the average value of A is equal 



to 1^ X)^es -^-4- perform the same steps for every 
sensitive value x and take the average as the reporting aver- 
age value of A. For each measurement, we conducted the 
experiments 100 times and took the average. 

We conducted the experiments by varying four factors: 
(1) the QI size, (2) r, (3) query dimensionality qd and (4) 
selectivity s. 

Figure[T]shows the results when r is set to 10. Figureflja) 
shows that the execution time increases with the QI size 
because the algorithms have to process more QI attributes. 





(a) 



(b) 




ART - 
Anatomy - 
MASK - 
Injector 
t-closeness 
Theoretical Bound - 




(c) (d) 
Figure 1 . Effect of QI size (r = 10) 



ART performs slower compared with Anatomy, MASK and 
t-closeness. Since ART requires to compute the QI based 
distribution with respect to every attribute set, when the QI 
size increases, the increase in the execution time of ART is 
larger. 

Figure [Hb) shows that there is an increase in average 
relative error when the QI size increases because it is more 
difficult to form A-groups where the difference in QI based 
distributions among all tuples in an A-group is small when 
the QI size is larger. Since t-closeness is a global recoding 
and causes a lot of unnecessary generalizations, the average 
relative error is the largest. Since Injector tries to exclude 
some sensitive values in an A-group, its relative error is also 
small. 

Figure [TJc) shows that the proportion of problematic tu- 
ples among sensitive tuples increases with QI size. With 
a larger QI size, there is a higher chance that individual 
privacy breaches due to more attributes which can be used 
to construct the QI based distributions. MASK has fewer 
privacy breaches compared with t-closeness. Anatomy and 
Injector because the side-effect of the minimization of QI 
values in each A-group adopted in MASK makes the differ- 
ence in the QI based distribution among all tuples in each 
A-group smaller Thus, the number of individual with pri- 
vacy breaches is smaller It is noted that there is no violation 
mART. 

In FigurefHd), we include the theoretical bound of Amax 
from Theorem[T]for comparison. We use the bound of ART 
as this theoretical bound because, compared with Anatomy 
and Injector, the size of A-groups formed in ART is largest 
(which yields the largest bound). Since the average value 
of A of Anatomy and Injector are greater than this bound, 
they may have privacy breaches as shown in Figure [Uc). 
When the QI size increases, the average value of A with 
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respect to every attribute set increases, as shown in Fig- 
ure[nd). With a larger QI size, during forming an A-group, 
we have to consider A with respect to more attribute sets. 
Thus, it is more likely that an A-group has a larger average 
value of A with respect to every attribute set. The average 
value of A is the largest in Anatomy and Injector, and the 
next two largest in MASK and t-closeness. This is because 
Anatomy and Injector does not take our QI based distribu- 
tion directly into the consideration for merging but MASK 
and t-closeness do indirectly during the minimization of QI 
values. In Figure [Tfd), although the average value of A of 
MASK is smaller than the theoretical bound of A, it is pos- 
sible to breach privacy as shown in Figure [Ttc) because this 
evaluation only shows the average value and the actual A 
in some A-groups is larger than this bound. 

We also conducted experiments when r = 2. For the 
sake of space, we did not show the figures. The results are 
also similar But, the execution time and the average relative 
error are smaller Since r is smaller and thus 1 /r is larger, 
the average value of A is larger when r = 2. 

6 Related Work 

With respect to attribute types considered for data 
anonymization, there are two branches of studying. The 
first branch is anonymization according to the QI attributes. 
A typical model is fc-anonymity 12J. The other branch is the 
consideration of both QI attributes and sensitive attributes. 
Some examples are IH, |[l9l, IS), flOl and 0. In this 
paper, we focus on this branch. We want to check whether 
the probability that each individual is linked to any sensitive 
value is at most a given threshold. 

/-diversity lITSl proposes a model where I is a positive 
integer and each A-group contains / "well-represented" val- 
ues in the sensitive attribute. For f-closeness |9|, the distri- 
bution in each A-group in T* with respect to the sensitive 
attribute is roughly equal to the distribution of the entire ta- 
ble T*. 

In the literature, different kinds of background knowl- 
edge ai-e considered l[T3][i3[T8l[l2ll21[l0l[ll. 01 con- 
siders another background knowledge in form of implica- 
tions. 1181 discovers that the minimality principle of the 
anonymization algorithm can also be used as background 
knowledge. lfT2l proposes to use the kernel estimation 
method to mine the background knowledge from the origi- 
nal table. 

lITOl finds that association rules can be mined from the 
original table and thus can be used for privacy protection 
during anonymization. In fT), the problem of privacy attack 
by adversarial association rule mining is investigated. How- 
ever, as pointed out in |[T6l . association rules used in ifTOl 
and O] can contradict the true statistical properties. Also 
the solution in fl | is to invalidate the rules, but this will vi- 
olate the data mining objectives of data publication. 



7 Conclusion 

In this paper, we consider the worst-case QI based distri- 
bution for privacy -preserving data publishing. Then, we de- 
rive a theoretical property and propose an algorithm which 
generates a dataset protecting individual privacy in the pres- 
ence of the worst-case QI based distribution. Finally, we 
conducted experiments to show that our proposed algorithm 
is efficient and incurs low information loss. For future 
work, we plan to investigate how to anonymize the dataset 
with other kinds of background knowledge that may be pos- 
sessed by the adversary. 

References 

[1] C. C. Aggarwal, J. Pei, and B. Zhang. On privacy preserva- 
tion against adversarial data mining. In KDD, 2006. 

[2] G. Aggarwal, T. Feder, K. Kenthapadi, R. Motwani, R. Pan- 
igrahy, D. Thomas, and A. Zhu. Anonymizing tables. In 
ICDT, 2005. 

[3] M. Barbaro and T. Z. Jr. A face is exposed for aol searcher 

no. 4417749. InA'ew York Times, 2006. 
[4] E. K. C. Blake and C. J. Merz. UCI 

repository of machine learning databases, 

http://www.ics.uci.edu/~mleam/MLRepository.html, 

1998. 

[5] J. Brickell and V. Shmatikov. The cost of privacy: Destruc- 
tion of data-mining utility in anonymized data publishing. In 
KDD, 2008. 

[6] B.-C. Chen, K. LeFevre, and R. Ramakrishnan. Privacy sky- 
line: Privacy with multidimensional adversarial knowledge. 
In VLDB, 2007. 

[7] S. R. Ganta, S. P. Kasiviswanathan, and A. Smith. Compo- 
sition attacks and auxiliary information in data privacy. In 
KDD, 2008. 

[8] K. LeFevre, D. DeWitt, and R. Ramakrishnan. Mondrian 

multidimensional k-anonymity. In ICDE, 2006. 
[9] N. Li and T. Li. /-closeness: Privacy beyond fc-anonymity 
and /-diversity. In ICDE, 2007. 

[10] T. Li and N. Li. Injector: Mining background knowledge for 
data anonymization. In ICDE, 2008. 

[II] T. Li and N. Li. On the tradeoff between privacy and utility 
in data publishing. In SIGKDD, 2009. 

[12] T. Li, N. Li, and J. Zhang. Modeling and integrating back- 
ground knowledge in data anonymization. In ICDE, 2009. 

[13] A. Machanavajjhala, J. Gehrke, and D. Kifer. /-diversity: 
privacy beyond fc-anonymity. In ICDE, 2006. 

[14] D. J. Martin, D. Kifer, A. Machanavajjhala, and J. Gehrke. 
Worst-case background knowledge for privacy-preserving 
data publishing. In ICDE, 2007. 

[15] P. Samarati and L. Sweeney. Protecting privacy when 
disclosing information: k-anonymity and its enforce- 
ment through generalization and suppression, unpublished 
manuscript. In unpublished, 1998. 

[16] C. Silverstein, R. Motwani, and S. Brin. Beyond market bas- 
kets: Generalizing association rules to correlations. In SIG- 
MOD, 1997. 

[17] L. Sweeney, k-anonymity: a model for protecting privacy. 
International journal on uncertainty, Euzziness and knowl- 
dege based systems, 10(5), 2002. 



9 



[18] R. Wong, A. Fu, K. Wang, and J. Pei. Minimality attack in 

privacy preserving data publishing. In VLDB, 2007. 
[19] R. Wong, J. Li, A. Fu, and K. Wang, (alpha, k)-anonymity: 

An enhanced k-anonymity model for privacy-preserving data 

publishing. In KDD, 2006. 
[20] X. Xiao and Y. Tao. Anatomy: Simple and effective privacy 

preservation. In VLDB, 2006. 
[21] Q. Zhang, N. Koudas, D. Srivastava, and T. Yu. Aggregate 

query answering on anonymized tables. In ICDE, 2007. 



8 Appendix 

Here we prove our main theorem. Let us recap a few notations. 



p{si : x) 



probability that signature s; is linked to x 
a simplified notation for p{si : x) 
fmax, maximum fi value among alH's 



Proof of Theorem [T) Let tu be a tuple in L with the greatest 
global probability linking to a:: in L (i.e., for all tuples t„ in L, 
fu > fv)- Besides, f = fu- 

Consider the set Wu of possible worlds where "tu ■ x" occurs. 
Let t-u be a tuple such that A„ = maXog[i,iv]{Aa}. Consider the 
set of possible worlds Wv where "t„ : x" occurs. 

Consider also the set of possible worlds Wa where "ta : x" 
occurs for an arbitrary ta where ta ^ tv We first want to show 
that p{Wa) > p{Wv), where p{Wa) is the probability that any 
world in Wa occurs. 

Lemma 2 For a £ [1, iV], p{Wa) > piWu). 

Proof of Lemma m Since Ai, = ma,Xae[i.N]{^a}, fv < fa and 
(1 - fv) > (1 - fa). Hence, 



fail - fv) > fv{l - fa) 



(6) 



For a world Wv £ Wv, p{w 

For a world uia G Wa,p{wa) = Pi,wa x ■•• xpiv,™„. 
Note that p„,u,^ = /„ andpa,u.„ = fa- 

Since there is only one x occurrence in L, tv is not assigned 
with X in any Wa € Wa- Let W'a be a maximal subset of 
Wa where ti,'s are assigned to distinct X values. Obviously 

E»„Giv' P".»a = 1- fv- Hence, 



E 



(7) 



Similarly, since ta is not assigned with x in any u)„ £ Wv, we 
can find a maximal subset Wi, in Wv where t„'s are assigned to 
distinct X values, we have Y.vi^&w'. P'^'^^ = 1 - /a- 

E^^,„eVl'^P^',^i'^. X Po..m^ = fv{l - fa) (8) 

From l|7J, and 

^2vUa€W^ P^-^a ^ Pv,Wa ^ ^2v}^ GW^ ^Pa,W^ (9) 

For each Wa G W'a we can find a unique Wv in W^, so that fv 
in Wa and fa in Wv are assigned the same sensitive value.We say 
that Wa and Wv are matching. Let us further restrict based on 
W'a in such a way that the matching world w„ in W^ for Wa in W'a 



has the same sensitive value assignments for the remaining tuples. 
It is obvious that we can always form such an W!^ from and any 
Wa- For matching Wa and Wv, 



rii^fa.D} P».™o ^ Wil^{a.v}P'i-,™^ 



(10) 



Furthermore, Wa can be partitioned into W^'s. and the union 
of the corresponding is equal to Wv- 
From lO and UOI l, we conclude that 



ui^ew^j 



Pl,w^ X ... X Pjv,, 



That is, T.u.^ew^Pi'^a) > T,u,^ew^ Pi'^v)- 
Therefore, for a G [1, Af], 

p{Wa) > p{Wv) 

This completes the proof of Lemma|2] 



(11) 
□ 



Lemma 3 // p{tu ■ x) < 1 /r, then p{ta : x) < 1/r for all 
a G [l,Af]. 

Proof of Lemma |3| By similar techniques used in the proof of 
Lemma |2l since fu > fa for all a G [1,A''], we derive that 
p{Wu) > p{Wa). Let K = E„'gwP(^') wh'^i''^ W is a set 
of all possible worlds. Since p(ttj ; x) = p{Wu\L) —p{Wu)/K 
and p{ta ■■ x) = p{Wa\L) = p{Wa)/K, we have p{tu : x) > 
p{ta '- x)- Thus, if p{tu ■ x) < 1/r, then, for all a G [l,Ai'], 
p{ta ■■ x) < 1/r. 

This completes the proof of Lemma[3] g 

Lemma[3]suggest that, once p{tu x) is bounded 1 /r, all other 
probabilities p{ta : x) in the A-group are also bounded. In the 
following, we focus on analyzing p{tu : x) only (instead of all 
probabilities p{ta : x)). 

Consider p{tu ■ x), which is equal to p{wu\L). Let W be a 
set of all possible worlds. Let VV*^*"'^' be the set of all possible 
worlds with "tu ■ x'\ By definition VV'*" '^^ = Wu and there are 
N such sets of worlds in W. Also, 



p(Wu\L) = 



p{w) 



EluGW P^^ 



y 



p{w) 



p{Wu) 



P[Wu)+T.a^uPiWa) 

By LemmaH T^a^uPi"^^) > (N - l)p{Wv)- Hence, 

piWu) 



p{Wu\L) < 



p{Wu) + (N - l)p{Wv) 



(12) 



From the proof of Lemma (2] Wu and Wv can be par- 
titioned into matching pairs of Wu and Wy where 
E™„gvKiP('^") = /"(I ^ Z")*^ for some C and 
T,u,^ew' pM = /4I - ./")C- 



Therefore, we can simplify Inequality il2i as follows. 

fu{l-fv) 



p{wu\L) < 



fuil - fv) + {N - 1) X fv(l - fu) 



(13) 
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Consider the term (A'^ — 1) x — /„) in Inequality l ll3t 

(iV-l)(l -/„)/„ 
= (iV-l)(l -/)(/- A,) 

(^-l)(l-/)(/-A„) 



(r-l)/(l-/ + A,) X 
(r-l)/4l-/„) X 



(^-1)/(1-/ + A„) 
(iV-l)(l-/) 



(r-l)/(7^-l) 

After substituting A„ < (A^ - + (iV - 1)] into 

tiie above equation, with simple derivations, we obtain 

(iV - 1) X Ml - U) > (r - 1) X Ud - /„) 

With the above inequality. Inequality JI3I ) becomes 

p{Wu\L) < 1/r (14) 

This completes the proof of Theorem[T] g 
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